{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Llama Index  Text Chunking Strategies\n"
      ],
      "metadata": {
        "id": "1o6oVH_fNA0N"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "The aim is to get the data in a format where it can be used for anticipated tasks, and retrieved for value later. Rather than asking “How should I chunk my data?”, the actual question should be “What is the optimal way for me to pass data to my language model that it needs for its task?”\n",
        "\n",
        "\n",
        "This example shows different types of chunking which can be utilized for different types of data for making sense out of chunks too, and not doing chunking for the sake of doing."
      ],
      "metadata": {
        "id": "Gu2pbLdq0O0L"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lCgSu4wS5L02",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "c9cef73c-2e7b-4415-f1e0-3dad04cb76c6"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m496.7/496.7 kB\u001b[0m \u001b[31m5.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m8.4/8.4 MB\u001b[0m \u001b[31m18.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m15.4/15.4 MB\u001b[0m \u001b[31m33.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m28.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m268.3/268.3 kB\u001b[0m \u001b[31m19.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.6/75.6 kB\u001b[0m \u001b[31m7.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m136.1/136.1 kB\u001b[0m \u001b[31m15.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.8/1.8 MB\u001b[0m \u001b[31m58.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m290.4/290.4 kB\u001b[0m \u001b[31m21.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m5.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m6.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.4/49.4 kB\u001b[0m \u001b[31m4.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h"
          ]
        }
      ],
      "source": [
        "# install required libraries\n",
        "!pip install llama_index tree_sitter tree_sitter_languages -q"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Data files for applying different chunking methods"
      ],
      "metadata": {
        "id": "unW0OVcp0-ZQ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Download for running any text file\n",
        "!wget https://raw.githubusercontent.com/lancedb/vectordb-recipes/main/README.md\n",
        "!wget https://frontiernerds.com/files/state_of_the_union.txt"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "xYx0PGZTD2xs",
        "outputId": "042bb5d1-cb90-45c9-8183-0c986193cdb1"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2024-04-15 10:06:43--  https://raw.githubusercontent.com/lancedb/vectordb-recipes/main/README.md\n",
            "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
            "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 29701 (29K) [text/plain]\n",
            "Saving to: ‘README.md’\n",
            "\n",
            "\rREADME.md             0%[                    ]       0  --.-KB/s               \rREADME.md           100%[===================>]  29.00K  --.-KB/s    in 0.002s  \n",
            "\n",
            "2024-04-15 10:06:43 (12.1 MB/s) - ‘README.md’ saved [29701/29701]\n",
            "\n",
            "--2024-04-15 10:06:43--  https://frontiernerds.com/files/state_of_the_union.txt\n",
            "Resolving frontiernerds.com (frontiernerds.com)... 172.67.180.189, 104.21.31.232, 2606:4700:3036::6815:1fe8, ...\n",
            "Connecting to frontiernerds.com (frontiernerds.com)|172.67.180.189|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: unspecified [text/plain]\n",
            "Saving to: ‘state_of_the_union.txt’\n",
            "\n",
            "state_of_the_union.     [ <=>                ]  39.91K  --.-KB/s    in 0.001s  \n",
            "\n",
            "2024-04-15 10:06:43 (64.5 MB/s) - ‘state_of_the_union.txt’ saved [40864]\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## File based Node Parsers\n",
        "\n",
        "There are different types of file-based parsers that create nodes depending on the content they are reading (like JSON, Markdown, etc.).\n",
        "\n",
        "The easiest way is to use the FlatFileReader with the SimpleFileNodeParser, which will automatically choose the right parser for each type of content. After that, you might want to add a text-based parser to handle the actual length of the text."
      ],
      "metadata": {
        "id": "2olIZB2unXwz"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Node Parser - [Simple File](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#simplefilenodeparser)\n",
        "\n",
        "Covering all the files intelligently"
      ],
      "metadata": {
        "id": "HtW5uzownkVP"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Simple File\n",
        "from llama_index.core.node_parser import SimpleFileNodeParser\n",
        "from llama_index.readers.file import FlatReader\n",
        "from pathlib import Path\n",
        "\n",
        "md_docs = FlatReader().load_data(Path(\"README.md\"))\n",
        "\n",
        "parser = SimpleFileNodeParser()\n",
        "\n",
        "# Additionally, you can augment this with a text-based parser to accurately handle text length\n",
        "md_nodes = parser.get_nodes_from_documents(md_docs)\n",
        "md_nodes[0].text"
      ],
      "metadata": {
        "id": "GqWdmhBdWrB4",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 105
        },
        "outputId": "172b7873-06cc-4a71-fd5b-acdcd24aeb0d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'VectorDB-recipes\\n<br />\\nDive into building GenAI applications!\\nThis repository contains examples, applications, starter code, & tutorials to help you kickstart your GenAI projects.\\n\\n- These are built using LanceDB, a free, open-source, serverless vectorDB that **requires no setup**. \\n- It **integrates into python data ecosystem** so you can simply start using these in your existing data pipelines in pandas, arrow, pydantic etc.\\n- LanceDB has **native Typescript SDK** using which you can **run vector search** in serverless functions!\\n\\n<img src=\"https://github.com/lancedb/vectordb-recipes/assets/5846846/d284accb-24b9-4404-8605-56483160e579\" height=\"85%\" width=\"85%\" />\\n\\n<br />\\nJoin our community for support - <a href=\"https://discord.gg/zMM32dvNtd\">Discord</a> •\\n<a href=\"https://twitter.com/lancedb\">Twitter</a>\\n\\n---\\n\\nThis repository is divided into 3 sections:\\n- [Examples](#examples) - Get right into the code with minimal introduction, aimed at getting you from an idea to PoC within minutes!\\n- [Applications](#projects--applications) - Ready to use Python and web apps using applied LLMs, VectorDB and GenAI tools\\n- [Tutorials](#tutorials) - A curated list of tutorials, blogs, Colabs and courses to get you started with GenAI in greater depth.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 4
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Node Parser - [HTML](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#htmlnodeparser)\n",
        "\n",
        "This node parser uses beautifulsoup to parse raw HTML.\n",
        "\n",
        "By default, it will parse a select subset of HTML tags, but you can override this.\n",
        "\n",
        "The default tags are: [\"p\", \"h1\", \"h2\", \"h3\", \"h4\", \"h5\", \"h6\", \"li\", \"b\", \"i\", \"u\", \"section\"]"
      ],
      "metadata": {
        "id": "-au7BAS2nvBC"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "#  HTML\n",
        "\n",
        "import requests\n",
        "from llama_index.core import Document\n",
        "from llama_index.core.node_parser import HTMLNodeParser\n",
        "\n",
        "# URL of the website to fetch HTML from\n",
        "url = \"https://www.utoronto.ca/\"\n",
        "\n",
        "# Send a GET request to the URL\n",
        "response = requests.get(url)\n",
        "print(response)\n",
        "\n",
        "# Check if the request was successful (status code 200)\n",
        "if response.status_code == 200:\n",
        "    # Extract the HTML content from the response\n",
        "    html_doc = response.text\n",
        "    document = Document(id_=url, text=html_doc)\n",
        "\n",
        "    parser = HTMLNodeParser(tags=[\"p\", \"h1\"])\n",
        "    nodes = parser.get_nodes_from_documents([document])\n",
        "    print(nodes)\n",
        "else:\n",
        "    # Print an error message if the request was unsuccessful\n",
        "    print(\"Failed to fetch HTML content:\", response.status_code)"
      ],
      "metadata": {
        "id": "Zhe7xYJtXw4l",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "7f32b9e5-225e-4a3e-b9a0-9a6287615f5a"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "<Response [200]>\n",
            "[TextNode(id_='bf308ea9-b937-4746-8645-c8023e2087d7', embedding=None, metadata={'tag': 'h1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://www.utoronto.ca/', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='247fb639a05bc6898fd1750072eceb47511d3b8dae80999f9438e50a1faeb4b2'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='7c280bdf-7373-4be8-8e70-6360848581e9', node_type=<ObjectType.TEXT: '1'>, metadata={'tag': 'p'}, hash='3e989bb32b04814d486ed9edeefb1b0ce580ba7fc8c375f64473ddd95ca3e824')}, text='Welcome to University of Toronto', start_char_idx=2784, end_char_idx=2816, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), TextNode(id_='7c280bdf-7373-4be8-8e70-6360848581e9', embedding=None, metadata={'tag': 'p'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://www.utoronto.ca/', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='247fb639a05bc6898fd1750072eceb47511d3b8dae80999f9438e50a1faeb4b2'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='bf308ea9-b937-4746-8645-c8023e2087d7', node_type=<ObjectType.TEXT: '1'>, metadata={'tag': 'h1'}, hash='e1e6af749b6a40a4055c80ca6b821ed841f1d20972e878ca1881e508e4446c26')}, text='In photos: Under cloudy skies, U of T community gathers to experience near-total solar eclipse\\nYour guide to the U of T community\\nThe University of Toronto is home to some of the world’s top faculty, students, alumni and staff. U of T Celebrates recognizes their award-winning accomplishments.\\nDavid Dyzenhaus recognized with Gold Medal from Social Sciences and Humanities Research Council\\nOur latest issue is all about feeling good: the only diet you really need to know about, the science behind cold plunges, a uniquely modern way to quit smoking, the “sex, drugs and rock ‘n’ roll” of university classes, how to become a better workplace leader, and more.\\nFaculty and Staff\\nHis course about the body is a workout for the mind\\nProfessor Doug Richards teaches his students the secret to living a longer – and healthier – life\\n\\nStatement of Land Acknowledgement\\nWe wish to acknowledge this land on which the University of Toronto operates. For thousands of years it has been the traditional land of the Huron-Wendat, the Seneca, and the Mississaugas of the Credit. Today, this meeting place is still the home to many Indigenous people from across Turtle Island and we are grateful to have the opportunity to work on this land.\\nRead about U of T’s Statement of Land Acknowledgement.\\nUNIVERSITY OF TORONTO - SINCE 1827', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n')]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Node Parser - [JSON](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#jsonnodeparser)\n",
        "The JSONNodeParser parses raw JSON."
      ],
      "metadata": {
        "id": "rbn4Rvt-n4Zr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# JSON\n",
        "\n",
        "from llama_index.core.node_parser import JSONNodeParser\n",
        "\n",
        "url = \"https://housesigma.com/bkv2/api/search/address_v2/suggest\"\n",
        "\n",
        "payload = {\"lang\": \"en_US\", \"province\": \"ON\", \"search_term\": \"Mississauga, ontario\"}\n",
        "\n",
        "headers = {\"Authorization\": \"Bearer 20240127frk5hls1ba07nsb8idfdg577qa\"}\n",
        "\n",
        "response = requests.post(url, headers=headers, data=payload)\n",
        "\n",
        "if response.status_code == 200:\n",
        "    document = Document(id_=url, text=response.text)\n",
        "    parser = JSONNodeParser()\n",
        "\n",
        "    nodes = parser.get_nodes_from_documents([document])\n",
        "    print(nodes[0])\n",
        "else:\n",
        "    print(\"Failed to fetch JSON content:\", response.status_code)"
      ],
      "metadata": {
        "id": "CW8pTEsEYdgL",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "28dae2de-f880-4874-95bb-5de82a716019"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Node ID: 05325093-16a2-41ac-b952-3882c817ac4d\n",
            "Text: status True data house_list id_listing owJKR7PNnP9YXeLP data\n",
            "house_list house_type_in_map D data house_list price_abbr 0.75M data\n",
            "house_list price 749,000 data house_list price_sold 690,000 data\n",
            "house_list tags Sold data house_list list_status public 1 data\n",
            "house_list list_status live 0 data house_list list_status s_r Sale\n",
            "data house_list list_s...\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Node Parser - [Markdown](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#markdownnodeparser)\n",
        "The MarkdownNodeParser parses raw markdown text."
      ],
      "metadata": {
        "id": "VYBTqzmJn9Z5"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Markdown\n",
        "from llama_index.core.node_parser import MarkdownNodeParser\n",
        "\n",
        "md_docs = FlatReader().load_data(Path(\"README.md\"))\n",
        "parser = MarkdownNodeParser()\n",
        "\n",
        "nodes = parser.get_nodes_from_documents(md_docs)\n",
        "nodes[0].text"
      ],
      "metadata": {
        "id": "55f43LgJYkok",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 105
        },
        "outputId": "3b9a3865-0a58-4e53-cff5-17cb8d024631"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'VectorDB-recipes\\n<br />\\nDive into building GenAI applications!\\nThis repository contains examples, applications, starter code, & tutorials to help you kickstart your GenAI projects.\\n\\n- These are built using LanceDB, a free, open-source, serverless vectorDB that **requires no setup**. \\n- It **integrates into python data ecosystem** so you can simply start using these in your existing data pipelines in pandas, arrow, pydantic etc.\\n- LanceDB has **native Typescript SDK** using which you can **run vector search** in serverless functions!\\n\\n<img src=\"https://github.com/lancedb/vectordb-recipes/assets/5846846/d284accb-24b9-4404-8605-56483160e579\" height=\"85%\" width=\"85%\" />\\n\\n<br />\\nJoin our community for support - <a href=\"https://discord.gg/zMM32dvNtd\">Discord</a> •\\n<a href=\"https://twitter.com/lancedb\">Twitter</a>\\n\\n---\\n\\nThis repository is divided into 3 sections:\\n- [Examples](#examples) - Get right into the code with minimal introduction, aimed at getting you from an idea to PoC within minutes!\\n- [Applications](#projects--applications) - Ready to use Python and web apps using applied LLMs, VectorDB and GenAI tools\\n- [Tutorials](#tutorials) - A curated list of tutorials, blogs, Colabs and courses to get you started with GenAI in greater depth.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 10
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Text-Splitters\n",
        "\n",
        "Download a `.py` file for some other chunking methods"
      ],
      "metadata": {
        "id": "gCFoPc1PZFI5"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Download for running Code Splitting\n",
        "!wget https://raw.githubusercontent.com/lancedb/vectordb-recipes/main/applications/talk-with-podcast/app.py"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "rVkeWuwvDwu-",
        "outputId": "1ddb950a-0c0f-4be4-88fd-3029d53e6640"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2024-04-15 10:22:58--  https://raw.githubusercontent.com/lancedb/vectordb-recipes/main/applications/talk-with-podcast/app.py\n",
            "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...\n",
            "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 1582 (1.5K) [text/plain]\n",
            "Saving to: ‘app.py’\n",
            "\n",
            "\rapp.py                0%[                    ]       0  --.-KB/s               \rapp.py              100%[===================>]   1.54K  --.-KB/s    in 0s      \n",
            "\n",
            "2024-04-15 10:22:58 (12.1 MB/s) - ‘app.py’ saved [1582/1582]\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### [Code Splitting](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#codesplitter)\n",
        "\n",
        "Splits raw code-text based on the language it is written in.\n",
        "\n",
        "Check the full list of [supported languages here](https://github.com/grantjenks/py-tree-sitter-languages#license)."
      ],
      "metadata": {
        "id": "spibLOthoCsK"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Code Splitting\n",
        "\n",
        "from llama_index.core.node_parser import CodeSplitter\n",
        "\n",
        "documents = FlatReader().load_data(Path(\"app.py\"))\n",
        "splitter = CodeSplitter(\n",
        "    language=\"python\",\n",
        "    chunk_lines=40,  # lines per chunk\n",
        "    chunk_lines_overlap=15,  # lines overlap between chunks\n",
        "    max_chars=1500,  # max chars per chunk\n",
        ")\n",
        "nodes = splitter.get_nodes_from_documents(documents)\n",
        "nodes[0].text"
      ],
      "metadata": {
        "id": "IDoDzDeiYqpL",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 140
        },
        "outputId": "5ac4578a-c5de-4060-cf9c-420a9078652b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "/usr/local/lib/python3.10/dist-packages/tree_sitter/__init__.py:36: FutureWarning: Language(path, name) is deprecated. Use Language(ptr, name) instead.\n",
            "  warn(\"{} is deprecated. Use {} instead.\".format(old, new), FutureWarning)\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'from youtube_podcast_download import podcast_audio_retreival\\nfrom transcribe_podcast import transcribe\\nfrom chat_retreival import retrieverSetup, chat\\nfrom langroid_utils import configure, agent\\n\\nimport os\\nimport glob\\nimport json\\nimport streamlit as st\\n\\nOPENAI_KEY = os.environ[\"OPENAI_API_KEY\"]\\n\\n\\n@st.cache_resource\\ndef video_data_retreival(framework):\\n    f = open(\"output.json\")\\n    data = json.load(f)\\n\\n    # setting up reteriver\\n    if framework == \"Langchain\":\\n        qa = retrieverSetup(data[\"text\"], OPENAI_KEY)\\n        return qa\\n    elif framework == \"Langroid\":\\n        langroid_file = open(\"langroid_doc.txt\", \"w\")  # write mode\\n        langroid_file.write(data[\"text\"])\\n        cfg = configure(\"langroid_doc.txt\")\\n        return cfg\\n\\n\\nst.header(\"Talk with Youtube Podcasts\", divider=\"rainbow\")\\n\\nurl = st.text_input(\"Youtube Link\")\\nframework = st.radio(\\n    \"**Select Framework 👇**\",\\n    [\"Langchain\", \"Langroid\"],\\n    key=\"Langchain\",\\n)\\n\\nif url:\\n    st.video(url)\\n    # Podcast Audio Retreival from Youtube\\n    podcast_audio_retreival(url)\\n\\n    # Trascribing podcast audio\\n    filename = glob.glob(\"*.mp3\")[0]\\n    transcribe(filename)\\n\\n    st.markdown(f\"##### `{framework}` Framework Selected for talking with Podcast\")\\n    # Chat Agent getting ready\\n    qa = video_data_retreival(framework)\\n\\n\\nprompt = st.chat_input(\"Talk with Podcast\")\\n\\ni'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 13
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### [Sentence Splitting](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#sentencesplitter)\n",
        "The SentenceSplitter attempts to split text while respecting the boundaries of sentences."
      ],
      "metadata": {
        "id": "On40RuqBoGNL"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Sentence Splitting\n",
        "\n",
        "from llama_index.core.node_parser import SentenceSplitter\n",
        "\n",
        "documents = FlatReader().load_data(Path(\"state_of_the_union.txt\"))\n",
        "splitter = SentenceSplitter(\n",
        "    chunk_size=254,\n",
        "    chunk_overlap=20,\n",
        ")\n",
        "nodes = splitter.get_nodes_from_documents(documents)\n",
        "nodes[0].text"
      ],
      "metadata": {
        "id": "iNKuiCNrZOHl",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 105
        },
        "outputId": "47064bf4-4079-42df-83b6-d519ba92a135"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "\"Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\\n\\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.\\n\\nIt's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain. These were times that tested the courage of our convictions and the strength of our union. And despite all our divisions and disagreements, our hesitations and our fears, America prevailed because we chose to move forward as one nation and one people.\\n\\nAgain, we are tested. And again, we must answer history's call.\""
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 15
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Node Parser - [Sentence Window](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#sentencewindownodeparser)\n",
        "\n",
        "The SentenceWindowNodeParser is similar to other node parsers, except that it splits all documents into individual sentences. The resulting nodes also contain the surrounding \"window\" of sentences around each node in the metadata. Note that this metadata will not be visible to the LLM or embedding model.\n",
        "\n",
        "This is most useful for generating embeddings that have a very specific scope. Then, combined with a MetadataReplacementNodePostProcessor, you can replace the sentence with it's surrounding context before sending the node to the LLM.\n",
        "\n",
        "An example of setting up the parser with default settings is below. In practice, you would usually only want to adjust the window size of sentences."
      ],
      "metadata": {
        "id": "h5zc7_YmoJHP"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# SentenceWindowNodeParser\n",
        "\n",
        "import nltk\n",
        "from llama_index.core.node_parser import SentenceWindowNodeParser\n",
        "\n",
        "node_parser = SentenceWindowNodeParser.from_defaults(\n",
        "    window_size=3,\n",
        "    window_metadata_key=\"window\",\n",
        "    original_text_metadata_key=\"original_sentence\",\n",
        ")\n",
        "nodes = node_parser.get_nodes_from_documents(documents)\n",
        "nodes[0].text"
      ],
      "metadata": {
        "id": "76tbzUrMZRFF",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 53
        },
        "outputId": "7f57513d-da7b-45f9-96c0-5698e06f1562"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\\n\\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. '"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 16
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Node Parser - [Semantic Splitting](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#semanticsplitternodeparser)\n",
        "\"Semantic chunking\" is a new concept proposed Greg Kamradt in his video tutorial on 5 levels of embedding chunking: https://youtu.be/8OJC21T2SL4?t=1933.\n",
        "\n",
        "Instead of chunking text with a fixed chunk size, the semantic splitter adaptively picks the breakpoint in-between sentences using embedding similarity. This ensures that a \"chunk\" contains sentences that are semantically related to each other."
      ],
      "metadata": {
        "id": "zj45BKLMoRgp"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# SemanticSplitterNodeParser\n",
        "\n",
        "from llama_index.core.node_parser import SemanticSplitterNodeParser\n",
        "from llama_index.embeddings.openai import OpenAIEmbedding\n",
        "import os\n",
        "\n",
        "# Add OpenAI API key as environment variable\n",
        "os.environ[\"OPENAI_API_KEY\"] = \"sk-****\"\n",
        "\n",
        "embed_model = OpenAIEmbedding()\n",
        "splitter = SemanticSplitterNodeParser(\n",
        "    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model\n",
        ")\n",
        "\n",
        "nodes = splitter.get_nodes_from_documents(documents)\n",
        "nodes[0].text"
      ],
      "metadata": {
        "id": "wAp7BU25ZdRt",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 53
        },
        "outputId": "3f51c53b-7617-4c67-c247-125f7a6b84be"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\\n\\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. '"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 17
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### [Token Text Splitting](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#tokentextsplitter)\n",
        "\n",
        "The TokenTextSplitter attempts to split to a consistent chunk size according to raw token counts."
      ],
      "metadata": {
        "id": "vH9xni1SoWYE"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# TokenTextSplitting\n",
        "\n",
        "from llama_index.core.node_parser import TokenTextSplitter\n",
        "\n",
        "splitter = TokenTextSplitter(\n",
        "    chunk_size=254,\n",
        "    chunk_overlap=20,\n",
        "    separator=\" \",\n",
        ")\n",
        "nodes = splitter.get_nodes_from_documents(documents)\n",
        "nodes[0].text"
      ],
      "metadata": {
        "id": "9G61og__Ziec",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 105
        },
        "outputId": "5b11b5f9-94d1-4b6d-a7d5-a58025f58f2a"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "\"Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\\n\\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.\\n\\nIt's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain. These were times that tested the courage of our convictions and the strength of our union. And despite all our divisions and disagreements, our hesitations and our fears, America prevailed because we chose to move forward as one nation and one people.\\n\\nAgain, we are tested. And again, we must answer history's call.\\n\\nOne year ago, I took office amid two wars, an economy\""
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 18
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Relation based Node Parser"
      ],
      "metadata": {
        "id": "rpbqXxeaawOt"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Node Parser - [Hierarchical](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#hierarchicalnodeparser)\n",
        "This node parser will chunk nodes into hierarchical nodes. This means a single input will be chunked into several hierarchies of chunk sizes, with each node containing a reference to it's parent node.\n",
        "\n",
        "When combined with the AutoMergingRetriever, this enables us to automatically replace retrieved nodes with their parents when a majority of children are retrieved. This process provides the LLM with more complete context for response synthesis."
      ],
      "metadata": {
        "id": "z_HuwzzAoabc"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# HierarchicalNodeParser\n",
        "\n",
        "from llama_index.core.node_parser import HierarchicalNodeParser\n",
        "\n",
        "node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=[512, 254, 128])\n",
        "\n",
        "nodes = node_parser.get_nodes_from_documents(documents)\n",
        "nodes[0].text"
      ],
      "metadata": {
        "id": "qwZFEDlaZpKT",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 105
        },
        "outputId": "12888840-daf5-45f5-f934-e253b6036621"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "\"Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\\n\\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.\\n\\nIt's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain. These were times that tested the courage of our convictions and the strength of our union. And despite all our divisions and disagreements, our hesitations and our fears, America prevailed because we chose to move forward as one nation and one people.\\n\\nAgain, we are tested. And again, we must answer history's call.\\n\\nOne year ago, I took office amid two wars, an economy rocked by severe recession, a financial system on the verge of collapse and a government deeply in debt. Experts from across the political spectrum warned that if we did not act, we might face a second depression. So we acted immediately and aggressively. And one year later, the worst of the storm has passed.\\n\\nBut the devastation remains. One in 10 Americans still cannot find work. Many businesses have shuttered. Home values have declined. Small towns and rural communities have been hit especially hard. For those who had already known poverty, life has become that much harder.\\n\\nThis recession has also compounded the burdens that America's families have been dealing with for decades -- the burden of working harder and longer for less, of being unable to save enough to retire or help kids with college.\\n\\nSo I know the anxieties that are out there right now. They're not new. These struggles are the reason I ran for president. These struggles are what I've witnessed for years in places like Elkhart, Ind., and Galesburg, Ill. I hear about them in the letters that I read each night.\""
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 19
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Langchain Text Chunking Strategies"
      ],
      "metadata": {
        "id": "64yUhjV_a9dk"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# install required packages and modules\n",
        "!pip install -qU langchain-text-splitters\n",
        "!pip install requests"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "8OijyBvqbCLC",
        "outputId": "8c5d46c4-0435-4f56-fe50-34c28d7846fd"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m287.5/287.5 kB\u001b[0m \u001b[31m5.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m113.0/113.0 kB\u001b[0m \u001b[31m11.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m53.0/53.0 kB\u001b[0m \u001b[31m6.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m144.8/144.8 kB\u001b[0m \u001b[31m11.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (2.31.0)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests) (3.3.2)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests) (3.6)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests) (2.0.7)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests) (2024.2.2)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Text Splitting - [Character](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html#charactertextsplitter)\n",
        "\n",
        "Splitting text that looks at characters."
      ],
      "metadata": {
        "id": "QVKmz3Rvok9Y"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Split with Character\n",
        "\n",
        "with open(\"state_of_the_union.txt\") as f:\n",
        "    state_of_the_union = f.read()\n",
        "\n",
        "\n",
        "from langchain_text_splitters import CharacterTextSplitter\n",
        "\n",
        "text_splitter = CharacterTextSplitter(\n",
        "    separator=\"\\n\\n\",\n",
        "    chunk_size=1000,\n",
        "    chunk_overlap=200,\n",
        "    length_function=len,\n",
        "    is_separator_regex=False,\n",
        ")\n",
        "\n",
        "texts = text_splitter.create_documents([state_of_the_union])\n",
        "print(texts[0].page_content)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "EdjiGEitI4La",
        "outputId": "29bb4bfd-e198-4902-e7c0-40c6df7d488b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1163, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1015, which is longer than the specified 1000\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\n",
            "\n",
            "Our Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Text Splitting - [Recursive Character](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html)\n",
        "\n",
        "Splitting text by recursively look at characters.\n",
        "\n",
        "Recursively tries to split by different characters to find one that works.\n",
        "\n"
      ],
      "metadata": {
        "id": "ivNYVKPZowKh"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Recursive Split Character\n",
        "\n",
        "# This is a long document we can split up.\n",
        "with open(\"state_of_the_union.txt\") as f:\n",
        "    state_of_the_union = f.read()\n",
        "\n",
        "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
        "\n",
        "text_splitter = RecursiveCharacterTextSplitter(\n",
        "    # Set a really small chunk size, just to show.\n",
        "    chunk_size=1000,\n",
        "    chunk_overlap=100,\n",
        "    length_function=len,\n",
        "    is_separator_regex=False,\n",
        ")\n",
        "\n",
        "texts = text_splitter.create_documents([state_of_the_union])\n",
        "print(\"Chunk 2: \", texts[1].page_content)\n",
        "print(\"Chunk 3: \", texts[2].page_content)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "9X6_duxwN3nI",
        "outputId": "d6ce1302-1a9a-4887-9505-1c206390ab2f"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Chunk 2:  It's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain. These were times that tested the courage of our convictions and the strength of our union. And despite all our divisions and disagreements, our hesitations and our fears, America prevailed because we chose to move forward as one nation and one people.\n",
            "\n",
            "Again, we are tested. And again, we must answer history's call.\n",
            "Chunk 3:  Again, we are tested. And again, we must answer history's call.\n",
            "\n",
            "One year ago, I took office amid two wars, an economy rocked by severe recession, a financial system on the verge of collapse and a government deeply in debt. Experts from across the political spectrum warned that if we did not act, we might face a second depression. So we acted immediately and aggressively. And one year later, the worst of the storm has passed.\n",
            "\n",
            "But the devastation remains. One in 10 Americans still cannot find work. Many businesses have shuttered. Home values have declined. Small towns and rural communities have been hit especially hard. For those who had already known poverty, life has become that much harder.\n",
            "\n",
            "This recession has also compounded the burdens that America's families have been dealing with for decades -- the burden of working harder and longer for less, of being unable to save enough to retire or help kids with college.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Text Splitting - [HTML Header](https://python.langchain.com/api_reference/text_splitters/html/langchain_text_splitters.html.HTMLHeaderTextSplitter.html#htmlheadertextsplitter)\n",
        "\n",
        "Splitting HTML files based on specified headers.\n",
        "\n",
        "Requires lxml package."
      ],
      "metadata": {
        "id": "I1nKMkm4o1Ft"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Split with HTML Tags\n",
        "\n",
        "from langchain_text_splitters import HTMLHeaderTextSplitter\n",
        "import requests\n",
        "\n",
        "# URL of the website to fetch HTML from\n",
        "url = \"https://www.utoronto.ca/\"\n",
        "\n",
        "# Send a GET request to the URL\n",
        "response = requests.get(url)\n",
        "if response.status_code == 200:\n",
        "    html_doc = response.text\n",
        "\n",
        "headers_to_split_on = [\n",
        "    (\"h1\", \"Header 1\"),\n",
        "    (\"h2\", \"Header 2\"),\n",
        "    (\"h3\", \"Header 3\"),\n",
        "]\n",
        "\n",
        "html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
        "html_header_splits = html_splitter.split_text(html_doc)\n",
        "html_header_splits[0].page_content"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "671M7BEVJ5zL",
        "outputId": "d6d5df83-98f5-42e5-c50a-de7455a46b93"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'Welcome to University of Toronto  \\nMain menu tools'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 29
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Text Splitting - Code"
      ],
      "metadata": {
        "id": "y8utGi0No6tr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Code Splitting\n",
        "\n",
        "from langchain_text_splitters import Language, RecursiveCharacterTextSplitter\n",
        "\n",
        "\n",
        "with open(\"app.py\") as f:\n",
        "    code = f.read()\n",
        "\n",
        "python_splitter = RecursiveCharacterTextSplitter.from_language(\n",
        "    language=Language.PYTHON, chunk_size=100, chunk_overlap=0\n",
        ")\n",
        "python_docs = python_splitter.create_documents([code])\n",
        "python_docs[0].page_content"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "9nfbAYj1KGQL",
        "outputId": "92edafe3-7d0c-4e20-90ea-5111c565b232"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'from youtube_podcast_download import podcast_audio_retreival'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 33
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Text Splitting - [Recursive JSON](https://python.langchain.com/api_reference/text_splitters/json/langchain_text_splitters.json.RecursiveJsonSplitter.html#recursivejsonsplitter)\n",
        "\n",
        "Splits JSON data into smaller, structured chunks while preserving hierarchy.\n",
        "\n",
        "This method splits JSON data into smaller dictionaries or JSON-formatted strings based on configurable maximum and minimum chunk sizes. It supports nested JSON structures, optionally converts lists into dictionaries for better chunking, and allows the creation of document objects for further use."
      ],
      "metadata": {
        "id": "tePMloUspEcX"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Recursive Split Json\n",
        "\n",
        "from langchain_text_splitters import RecursiveJsonSplitter\n",
        "import json\n",
        "import requests\n",
        "\n",
        "json_data = requests.get(\"https://api.smith.langchain.com/openapi.json\").json()\n",
        "\n",
        "splitter = RecursiveJsonSplitter(max_chunk_size=300)\n",
        "json_chunks = splitter.split_json(json_data=json_data)\n",
        "json_chunks[0]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "8bW_6wkmMAoR",
        "outputId": "73dadc8f-30bc-491f-c3f5-a95e75486971"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'openapi': '3.1.0',\n",
              " 'info': {'title': 'LangSmith', 'version': '0.1.0'},\n",
              " 'servers': [{'url': 'https://api.smith.langchain.com',\n",
              "   'description': 'LangSmith API endpoint.'}]}"
            ]
          },
          "metadata": {},
          "execution_count": 33
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### [Semantic Splitting](https://python.langchain.com/api_reference/experimental/text_splitter/langchain_experimental.text_splitter.SemanticChunker.html#semanticchunker)\n",
        "\n",
        "Split the text based on semantic similarity."
      ],
      "metadata": {
        "id": "a8Rt52AepNNk"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Semantic Chunking\n",
        "\n",
        "!pip install --quiet langchain_experimental langchain_openai\n",
        "\n",
        "import os\n",
        "from langchain_experimental.text_splitter import SemanticChunker\n",
        "from langchain_openai.embeddings import OpenAIEmbeddings\n",
        "\n",
        "# Add OpenAI API key as environment variable\n",
        "os.environ[\"OPENAI_API_KEY\"] = \"sk-****\"\n",
        "\n",
        "with open(\"state_of_the_union.txt\") as f:\n",
        "    state_of_the_union = f.read()\n",
        "\n",
        "text_splitter = SemanticChunker(OpenAIEmbeddings())\n",
        "\n",
        "docs = text_splitter.create_documents([state_of_the_union])\n",
        "print(docs[0].page_content)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "oHDYFAHeOjPA",
        "outputId": "f7a5bbc2-c432-4370-bbf5-e529f4ff8c77"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\n",
            "\n",
            "Our Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Splitting by Tokens\n",
        "\n",
        "[Langchain tiktoken encoder](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html#langchain_text_splitters.character.CharacterTextSplitter.from_tiktoken_encoder)\n",
        "\n",
        "Text splitter that uses tiktoken encoder to count length."
      ],
      "metadata": {
        "id": "dV7RMi7_pRWn"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Splits by Tokens\n",
        "\n",
        "# Using Tiktoken\n",
        "!pip install --upgrade --quiet tiktoken\n",
        "\n",
        "with open(\"state_of_the_union.txt\") as f:\n",
        "    state_of_the_union = f.read()\n",
        "\n",
        "from langchain_text_splitters import CharacterTextSplitter\n",
        "\n",
        "text_splitter = CharacterTextSplitter.from_tiktoken_encoder(\n",
        "    chunk_size=100, chunk_overlap=0\n",
        ")\n",
        "texts = text_splitter.split_text(state_of_the_union)\n",
        "\n",
        "print(texts[0])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "7_WVw_kEQmJg",
        "outputId": "93f501ed-d8a4-4350-a670-a6af25d2879d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "WARNING:langchain_text_splitters.base:Created a chunk of size 123, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 104, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 109, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 106, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 129, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 111, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 118, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 132, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 231, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 177, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 112, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 130, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 116, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 184, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 139, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 112, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 151, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 203, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 138, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 123, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 213, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 134, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 130, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 125, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 139, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 111, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 130, which is longer than the specified 100\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 124, which is longer than the specified 100\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\n",
            "\n",
            "Our Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "vMYrBTIvvGEg"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}